Goto

Collaborating Authors

 cloud infrastructure




Governed By Agents: A Survey On The Role Of Agentic AI In Future Computing Environments

Murad, Nauman Ali, Baloch, Safia

arXiv.org Artificial Intelligence

The emergence of agentic Artificial Intelligence (AI), which can operate autonomously, demonstrate goal-directed behavior, and adaptively learn, indicates the onset of a massive change in today's computing infrastructure. This study investigates how agentic AI models' multiple characteristics may impact the architecture, governance, and operation under which computing environments function. Agentic AI has the potential to reduce reliance on extremely large (public) cloud environments due to resource efficiency, especially with processing and/or storage. The aforementioned characteristics provide us with an opportunity to canvas the likelihood of strategic migration in computing infrastructures away from massive public cloud services, towards more locally distributed architectures: edge computing and on-premises computing infrastructures. Many of these likely migrations will be spurred by factors like on-premises processing needs, diminished data consumption footprints, and cost savings. This study examines how a solution for implementing AI's autonomy could result in a re-architecture of the systems and model a departure from today's governance models to help us manage these increasingly autonomous agents, and an operational overhaul of processes over a very diverse computing systems landscape that bring together computing via cloud, edge, and on-premises computing solutions. To enable us to explore these intertwined decisions, it will be fundamentally important to understand how to best position agentic AI, and to navigate the future state of computing infrastructures.


Cloud Infrastructure Management in the Age of AI Agents

Yang, Zhenning, Bhatnagar, Archit, Qiu, Yiming, Miao, Tongyuan, Kon, Patrick Tser Jern, Xiao, Yunming, Huang, Yibo, Casado, Martin, Chen, Ang

arXiv.org Artificial Intelligence

Cloud infrastructure is the cornerstone of the modern IT industry. However, managing this infrastructure effectively requires considerable manual effort from the DevOps engineering team. We make a case for developing AI agents powered by large language models (LLMs) to automate cloud infrastructure management tasks. In a preliminary study, we investigate the potential for AI agents to use different cloud/user interfaces such as software development kits (SDK), command line interfaces (CLI), Infrastructure-as-Code (IaC) platforms, and web portals. We report takeaways on their effectiveness on different management tasks, and identify research challenges and potential solutions.


LLM Assisted Anomaly Detection Service for Site Reliability Engineers: Enhancing Cloud Infrastructure Resilience

Jha, Nimesh, Lin, Shuxin, Jayaraman, Srideepika, Frohling, Kyle, Constantinides, Christodoulos, Patel, Dhaval

arXiv.org Artificial Intelligence

This paper introduces a scalable Anomaly Detection Service with a generalizable API tailored for industrial time-series data, designed to assist Site Reliability Engineers (SREs) in managing cloud infrastructure. The service enables efficient anomaly detection in complex data streams, supporting proactive identification and resolution of issues. Furthermore, it presents an innovative approach to anomaly modeling in cloud infrastructure by utilizing Large Language Models (LLMs) to understand key components, their failure modes, and behaviors. A suite of algorithms for detecting anomalies is offered in univariate and multivariate time series data, including regression-based, mixture-model-based, and semi-supervised approaches. We provide insights into the usage patterns of the service, with over 500 users and 200,000 API calls in a year. The service has been successfully applied in various industrial settings, including IoT-based AI applications. We have also evaluated our system on public anomaly benchmarks to show its effectiveness. By leveraging it, SREs can proactively identify potential issues before they escalate, reducing downtime and improving response times to incidents, ultimately enhancing the overall customer experience. We plan to extend the system to include time series foundation models, enabling zero-shot anomaly detection capabilities.


AI-Driven Resource Allocation Framework for Microservices in Hybrid Cloud Platforms

Barua, Biman, Kaiser, M. Shamim

arXiv.org Artificial Intelligence

The increasing demand for scalable, efficient resource management in hybrid cloud environments has led to the exploration of AI-driven approaches for dynamic resource allocation. This paper presents an AI-driven framework for resource allocation among microservices in hybrid cloud platforms. The framework employs reinforcement learning (RL)-based resource utilization optimization to reduce costs and improve performance. The framework integrates AI models with cloud management tools to respond to challenges of dynamic scaling and cost-efficient low-latency service delivery. The reinforcement learning model continuously adjusts provisioned resources as required by the microservices and predicts the future consumption trends to minimize both under- and over-provisioning of resources. Preliminary simulation results indicate that using AI in the provision of resources related to costs can reduce expenditure by up to 30-40% compared to manual provisioning and threshold-based auto-scaling approaches. It is also estimated that the efficiency in resource utilization is expected to improve by 20%-30% with a corresponding latency cut of 15%-20% during the peak demand periods. This study compares the AI-driven approach with existing static and rule-based resource allocation methods, demonstrating the capability of this new model to outperform them in terms of flexibility and real-time interests. The results indicate that reinforcement learning can make optimization of hybrid cloud platforms even better, offering a 25-35% improvement in cost efficiency and the power of scaling for microservice-based applications. The proposed framework is a strong and scalable solution to managing cloud resources in dynamic and performance-critical environments.


Reinforcement Learning-Based Adaptive Load Balancing for Dynamic Cloud Environments

Chawla, Kavish

arXiv.org Artificial Intelligence

Efficient load balancing is crucial in cloud computing environments to ensure optimal resource utilization, minimize response times, and prevent server overload. Traditional load balancing algorithms, such as round-robin or least connections, are often static and unable to adapt to the dynamic and fluctuating nature of cloud workloads. In this paper, we propose a novel adaptive load balancing framework using Reinforcement Learning (RL) to address these challenges. The RL-based approach continuously learns and improves the distribution of tasks by observing real-time system performance and making decisions based on traffic patterns and resource availability. Our framework is designed to dynamically reallocate tasks to minimize latency and ensure balanced resource usage across servers. Experimental results show that the proposed RL-based load balancer outperforms traditional algorithms in terms of response time, resource utilization, and adaptability to changing workloads. These findings highlight the potential of AI-driven solutions for enhancing the efficiency and scalability of cloud infrastructures.


Microsoft to invest 1.7 bbn in AI, cloud infrastructure in Indonesia

Al Jazeera

Microsoft has announced plans to invest 1.7bn in artificial intelligence and cloud services in Indonesia. Under the plans unveiled by Microsoft CEO Satya Nadella, the tech giant will train 840,000 people in Indonesia in the use of AI and provide support for the country's growing ranks of tech developers. The announcement marks the biggest investment by Microsoft in its nearly three-decade history in the Southeast Asian country. Nadella on Tuesday held talks with President Joko Widodo, popularly known as Jokowi, at Jakarta's presidential palace before delivering a keynote speech about AI in the Indonesian capital. "This new generation of AI is reshaping how people live and work everywhere, including in Indonesia," Nadella said on the first stop of a tour of Southeast Asia.


The Digital Insider

#artificialintelligence

Two months ago, Amazon didn't make a single mention of AI on its earnings call (Google and Microsoft mentioned AI dozens of times each). This past week, by contrast, the company's cloud division, Amazon Web Services (AWS), could talk about little else. As announced by Swami Sivasubramanian, vice president of database, analytics, and machine learning at AWS, the company is all over AI with the launch of new large language models (LLMs) and APIs to access them, as well as CodeWhisperer, a GitHub Copilot competitor, and more. It's not that AWS wasn't working on AI before; Amazon has been working with AI for decades. Rather, it's now impossible to ignore AI.


Can healthcare show the way forward for scaling AI?

#artificialintelligence

This article is part of a VB Lab Insights series on AI sponsored by Microsoft and Nvidia. Don't miss additional articles in this series providing new industry insights, trends and analysis on how AI is transforming organizations. Scaling artificial intelligence (AI) is tough in any industry. And healthcare ranks among the toughest, thanks to highly complex applications, scattered stakeholder networks, stringent licensing and regulations, data privacy and security -- and the life-and-death nature of the industry. "If you mis-forecast an inventory level because your AI doesn't work, that's not great, but you'll recover," says Peter Durlach, Executive Vice President and Chief Strategy Officer of Nuance Communications, a conversational AI company specializing in healthcare. "If your clinical AI makes a mistake, like missing a cancerous nodule on an X-ray, that can have more serious consequences." Even with the current willingness of many organizations to fund AI initiatives, many healthcare organizations lack the skilled staff, technical know-how and bandwidth to deploy and scale AI into clinical workflows.